Runbook: Certificate Expiry

Alert

Prometheus Alert: CertManagerCertExpirySoon / CertManagerCertNotReady
Grafana Dashboard: cert-manager dashboard
Firing condition: Certificate will expire within 30 days, or Certificate resource is in a not-ready state

Severity

Critical -- Expired certificates cause TLS failures across the platform, breaking ingress traffic, inter-service communication, and webhook connectivity.

Impact

Istio ingress gateway stops accepting HTTPS connections
Webhook services (Kyverno, cert-manager) fail validation/mutation
Internal mTLS certificates may fail rotation, degrading mesh communication
Harbor, Grafana, Keycloak UI access via Istio gateway breaks

Investigation Steps

List all certificates and their status:

kubectl get certificates -A

Check for certificates that are not ready or near expiry:

kubectl get certificates -A -o custom-columns='NAMESPACE:.metadata.namespace,NAME:.metadata.name,READY:.status.conditions[0].status,EXPIRY:.status.notAfter,RENEWAL:.status.renewalTime'

Describe the failing certificate for detailed condition messages:

kubectl describe certificate <certificate-name> -n <namespace>

Check cert-manager controller logs for errors:

kubectl logs -n cert-manager deployment/cert-manager --tail=100

Check the CertificateRequest resources associated with the failing certificate:

kubectl get certificaterequest -n <namespace>
kubectl describe certificaterequest <name> -n <namespace>

Check the Order and Challenge resources (for ACME/Let's Encrypt issuers):

kubectl get orders -A
kubectl get challenges -A

Verify the ClusterIssuer or Issuer is ready:

kubectl get clusterissuers
kubectl describe clusterissuer <issuer-name>

Check cert-manager webhook health:

kubectl get pods -n cert-manager
kubectl logs -n cert-manager deployment/cert-manager-webhook --tail=50

Check the cert-manager HelmRelease status:

flux get helmrelease cert-manager -n cert-manager

Resolution

Certificate stuck in not-ready state

Delete the failing CertificateRequest to trigger a new one:

kubectl delete certificaterequest <name> -n <namespace>

Force cert-manager to re-issue by adding a temporary annotation:

kubectl annotate certificate <name> -n <namespace> cert-manager.io/issue-temporary-certificate="true" --overwrite

Then remove it to trigger the real issuance:

kubectl annotate certificate <name> -n <namespace> cert-manager.io/issue-temporary-certificate-

ClusterIssuer not ready (self-signed CA)

Check the CA secret exists:

kubectl get secret -n cert-manager | grep ca

If the root CA secret is missing, recreate it. Check the ClusterIssuer spec for the expected secret name:

kubectl describe clusterissuer sre-ca-issuer

Re-apply the cert-manager manifests via Flux:

flux reconcile helmrelease cert-manager -n cert-manager

ACME challenge failure (Let's Encrypt)

Check challenge status:

kubectl describe challenge <name> -n <namespace>

Verify DNS is resolving correctly for the domain
Verify the HTTP-01 solver can reach the challenge endpoint (check Istio gateway and VirtualService)

Manual certificate renewal

Delete the existing secret to force re-issuance:

kubectl delete secret <tls-secret-name> -n <namespace>

cert-manager will detect the missing secret and re-issue automatically

cert-manager pods not running

Check the HelmRelease:

flux get helmrelease cert-manager -n cert-manager

If the release is in a failed state, suspend and resume:

flux suspend helmrelease cert-manager -n cert-manager
flux resume helmrelease cert-manager -n cert-manager

Force reconciliation:

flux reconcile helmrelease cert-manager -n cert-manager --with-source

Prevention

Monitor the certmanager_certificate_expiration_timestamp_seconds metric in Prometheus
Set alerting thresholds at 30 days (warning) and 7 days (critical) before expiry
Ensure cert-manager has sufficient RBAC to create/update secrets in all namespaces
Test certificate renewal in a staging environment before production
Keep cert-manager updated via the Flux HelmRelease version pin (currently 1.14.4)
Verify ClusterIssuer health after any cert-manager upgrade

Escalation

If certificate issuance fails after multiple retries: check upstream issuer (Let's Encrypt rate limits, internal CA health)
If cert-manager pods are crash-looping: escalate to platform team lead
If the issue affects Istio gateway TLS termination: this is a P1 incident affecting all external traffic

🕸️ Ada Research Browser

Runbook: Certificate Expiry

Alert

Severity

Impact

Investigation Steps

Resolution

Certificate stuck in not-ready state

ClusterIssuer not ready (self-signed CA)

ACME challenge failure (Let's Encrypt)

Manual certificate renewal

cert-manager pods not running

Prevention

Escalation